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ABSTRACT 


The  cost  of  system  operational  testing  is  steadily 
increasing.  It  is  desirable  for  the  software  manager  to  know 
if  the  software  is  sufficiently  well  developed  or  reliable  to 
support  such  testing.  Current  software  relisdjility  models 
provide  only  point  estimates  of  the  mean  time  to  next  failure 
or  expected  number  of  errors  to  occur  in  additional  testing 
time . 


The  goal  of  this  thesis  is  to  take  into  account  prediction 
uncertainties  of  a  software  reliability  model.  Bootstrapping 
is  used  to  provide  the  software  manager  with  confidence  limits 
of  the  predicted  expected  nxamber  of  faults  to  occur  for 
additional  testing  time.  The  results  can  be  particularly 
useful  to  a  software  manager  who  has  to  answer  a  subjective 
question:  is  the  software  reliable  enough  to  support  system 
operational  testing?  A  range  of  predicted  expected  number  of 
faults  will  be  of  more  use  to  a  software  manager,  who  has  to 
justify  the  answer  to  this  question,  than  just  a  point 
estimate.  Two  software  fault  data  sets  are  analyzed  with  this 
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I .  INTRODUCTION 

Prior  to  costly  operational  testing  of  a  system  consisting 
of  hardware  and  its  embedded  software,  it  would  be  highly 
desirable  to  know  whether  these  two  major  conqponents  are 
sufficiently  reliable  to  support  such  testing.  Specifically, 
this  is  equivalent  to  asking  whether  the  software  has  reached 
a  state  of  maturity  such  that  unforeseen  faults  (bugs,  errors, 
system  crashes,  etc.)  are  not  likely  to  occur  during 
operational  test  of  the  entire  system,  or  later,  during  a 
systemic  mission. 

Estimation  of  hardware  reliability  is  relatively  well- 
understood.  Unfortunately,  software  relieOjility  or  maturity 
prediction  is  not  as  well  understood  at  this  time.  The 
ANSI/IEEE  definition  of  software  reliability  is  the  ability  of 
a  program  to  perform  a  required  function  under  stated 
conditions  for  a  stated  period  of  time  (IEEE,  1984)  .  Since 
testing  software  has  an  associated  cost  whether  it  is  in 
computer  run  time,  IcdDor  costs,  lost  market  share  resulting 
from  late  delivery  of  a  product  or,  in  the  case  of  military 
equipment,  sacrificed  range- testing  time  and  aborted  missions, 
there  is  a  finite  time  allocated  for  testing  and  removal  of 
faults  (bugs) .  A  moderate -sized  program  with  264  branches 
would  have  independent  paths  (greater  than  the  estimated 
number  of  atoms  in  the  universe)  .  Obviously,  it  is  infeasible 
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to  test  each  path  (Dalai  and  Mallows,  1988)  .  Testing  and 
debugging  costs  are  estimated  to  range  from  50%  to  80%  of  the 
costs  for  development  of  a  working  version  of  software 
(Beizer,  1984)  .  The  constraints  of  a  finite  time  period  for 
testing  and  the  cost  of  testing  are  excellent  incentives  for 
prompt  and  accurate  determination  of  software  relicdoility . 

Put  in  the  form  of  a  question:  when  can  testing  be  stopped 
and  the  product  delivered  with  a  high  level  of  confidence  that 
the  customer  will  be  satisfied? 

Software  reliability  estimation  is  based  on  the  results  of 
testing.  Software  testing  can  be  broken  down  into  four  major 
categories:  unit,  integration,  system  and  regression  testing. 

Unit  testing  is  usually  done  by  the  programmer  in  an  informal 
manner.  Integration  testing  is  done  in  an  orderly  progression 
such  that  the  software  elements  are  combined  and  tested  until 
the  entire  software  package  has  been  tested.  System  testing 
is  integration  of  hardware  and  software  to  verify  that  the 
system  meets  specified  requirements.  Regression  testing  is 
retesting  to  detect  faults  that  may  have  been  introduced 
during  program  modification  (Hernandez,  1989)  .  One  purpose  of 
testing  is  to  produce  quantitative  measures  of  software  error- 
proneness  after  effort  has  been  expended  in  the  integration 
testing,  system  testing,  and  fault  removal  phases.  » 

Software  testing,  a  follow-on  to  hardware  reliability 
prediction  has  been  of  consideredjle  inportance  and  interest 
from  the  mid- 1960 's  to  the  present.  The  Navy's  Operational 
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Test  and  Evaluation  Force  recently  (January,  1992)  held  a 
symposium  for  DoD  agencies  to  discuss  and  exchange  ideas  and 
methodologies  on  software  testing  and  reliability.  There  are 
two  basic  differences  between  hardware  and  software 
reliability  predictions.  Hardware  prediction  usually  assumes 
independence  of  failures,  and,  after  some  point,  the 
reliability  measuring  process  does  not  affect  the  failure 
rate.  Software  reliadaility  prediction  models  should  assxame 
interdependence  of  unit  failures,  and  that  testing  improves 
reliability.  Removing  a  program  fault  or  bug  during 
developmental  testing  reduces  the  likelihood  that  a  fault  will 
become  operative  later  in  an  operational  setting  that  will 
cause  a  mission  to  abort.  The  software  fault -prevalence  and 
appearance  prediction  problem  has  been  judged  to  be  inherently 
more  difficult  than  hardware  reliability  prediction  (Beizer, 
1984)  . 

There  are  several  software  reliability  models  that  will  be 
discussed  later.  Beizer  in  his  seminal  work  Software  System 
Testing  and  Quality  Assurance  (Beizer,  1984)  summed  up  the 
similarities  of  the  models  best. 

1.  Most  models  assume  a  fixed  but  unknown  number  of 
faults  when  testing. 

2.  Faults  are  universally  assumed  to  be  independent  (some 
of  the  later  models,  Schneidewind' s  Software  Reliability 
Model  for  example,  do  not  necessarily  make  this 
assvimption)  . 

3.  Most  models  assume  perfect  debugging.  That  is,  the 
debugging  process  introduces  no  new  faults.  However,  some 
of  the  later  models  take  into  account  that  not  all 
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detected  faults  will  be  fixed,  auid  that  the  debugging 
process  itself  may  introduce  new  faults  (Littlewood  and 
Verrall  's  Bayesian  RelicQDility  Growth  Model  takes  into 
account  in^jerfect  debugging)  . 

4.  Most  models  assume  that  test  time  and  calendar  time 
are  the  same. 

5.  The  models  assvune  that  failure  rate  is  proportional  to 
the  faults  remaining.  This  implicitly  means  that  faults 
are  assxamed  to  cause  single  failures  and  each  failure  can 
be  related  to  one  failure. 

6.  The  models  assume  path  homogeneity.  That  is,  data  are 
entered  randomly  and  such  data  uniformly  exercise  all 
code.  This  is  in  direct  contradiction  to  the  reality  that 
the  most  paths  cover  a  small  percentage  (say  under  10%)  of 
the  code. 

The  difference  between  fie  models  lies  in  the  degree  with 
which  these  assumptions  hold  true,  i.e.  the  type  of  random 
process  according  to  which  the  failures  occur,  and  how  data  is 
fitted  to  the  models  (Beizer,  1984) . 

The  models  that  are  described  in  Chapter  II  do  not 
necessarily  perform  well  for  all  types  of  data.  There  is  no 
"silver  bullet"  (Brooks,  1986)  that  will  take  on  all  comers 
successfully.  One  model  may  predict  reliability  well  for  one 
data  source  but  not  another.  The  users  of  the  models  must 
take  into  consideration  the  predictive  quality  of  a  model 
prior  to  basing  decisions  on  the  output  of  the  model  (Abdalla 
et  al,  1986)  and  (Goel,  1985)  .  One  possible  way  to  do  this  is 
to  analyze  the  data  using  various  models.  The  manager  selects 
the  model  that  demonstrates  the  best  predictive  qualities, 
i.e.  the  model  that  appears  to  best  fit  the  data  and  provide 
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useful  results.  The  choice  is  difficult  because  it  is 
conducted  in  an  atmosphere  of  uncertainty. 

Our  hypothesis  is  that  software  relicLbility  can  be 
predicted,  but  with  error.  It  is  important  to  take  account  of 
the  variabilities  and  uncertainties  that  are  inevitably 
present,  at  least  those  associated  with  sampling  (finite 
data) ,  the  most  serious  errors  may  be  associated  with  model 
choice,  however.  To  test  this  hypothesis  of  predictability  we 
analyze  sources  of  fault  (error  or  bug)  data  using  a 
modification  of  the  BELLCORE  MODEL  (Dalai  and  Mallows,  1980) 
to  estimate  the  reliability  of  the  particular  software  project 
and  the  quality  of  the  prediction  produced  by  the  model. 
Parametric  estimates  are  made  by  maximum  likelihood  but  also 
by  use  of  an  approximate  Bayesian  technique.  Error  estimates 
are  made  by  a  re -sampling  technique  known  as  bootstrapping. 

The  parametric  bootstrap  technique  was  used  in  the 
aftermath  of  the  Challenger  disaster  to  analyze  the  0- rings 
that  failed.  Although  the  analysis  was  done  on  hardware  the 
methodology  that  we  propose  in  Chapter  III  and  the  appendix  is 
similar.  The  analysis  of  the  0- rings  showed  the  bootstrap  90% 
confidence  limits  expected  catastrophic  failure  rate  of  at 
least  13%  at  temperature  of  less  than  31  degrees,  but  less 
than  a  2%  failure  rate  at  temperatures  above  60  degrees  (Dalai 
et  al,  1909)  .  Had  the  NASA  decision  makers  had  this 
information  available  to  them  the  consideration  to  postpone 
the  launch  may  have  been  taken  more  seriously  and  the  disaster 
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prevented.  The  analogy  for  the  software  manager  to  consider 
is  the  predicted  number  of  faults  to  occur  for  some  specified 
time  acceptable.  It  is  hoped  that,  the  wrong  decision  will 
not  have  consequences  as  severe  as  the  Challenger  disaster. 
The  techniques  that  we  describe  provide  a  quantitative  tool 
for  the  software  manager  to  substantiate  the  decision  to 
schedule  (postpone)  system  operational  testing. 

In  Chapter  II,  we  briefly  describe  several  software 
reliability  prediction  models  that  have  been  proposed  in  order 
to  provide  a  basis  of  understanding  of  the  discussion.  In 
Chapter  III  and  the  appendix,  we  present  the  model  fitting 
procedure,  the  method  used  to  determine  the  quality  of  the 
prediction,  the  resulting  data  obtained  from  the  analysis,  and 
methods  to  improve  this  methodology  from  the  perspective  of  a 
software  manager.  In  Chapter  IV,  our  conclusions  are  provided 
and  directions  for  future  research  are  suggested. 
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II.  SURVEY  OF  SOFTWARE  RELIABILITY  METHODOLOGIES 

This  survey  is  concerned  with  only  two  categories  of 
software  reliability  models:  those  for  time  between  errors 
(TBE) ,  and  for  fault  count  (number  of  errors  in  a  specified 
time)  . 

A.  TIME  BETWEEN  ERRORS  (TBE) 

TBE  reliability  assessments  attempt  to  predict  the  mean 
time  between  failure  (MTBF)  of  the  ith  failure  based  on  that 
to  the  (i-l)th  failure.  The  TBE  can  be  measured  in  either 
central  processing  unit  (CPU)  time  or  wall -clock  time.  Wall- 
clock  time  can  be  misleading:  it  can  elapse  regardless  of 
whether  or  not  the  program  is  running.  From  this  information 
the  software  manager  can  gain  confidence  that  the  software 
will  exhibit  the  operational  capability  to  complete  its 
mission:  to  operate  without  failure  for  a  mission  time.  A 
system  that  experiences  multiple,  severe  software  errors  Lhixu 
prevent  the  system  from  completing  its  operational  mission  is 
not  ready  for  costly  live  exercises  as  in  operational  testing. 
For  example,  a  system  that  is  supposed  to  detect,  track  and 
engage  a  missile  during  a  scenario  of  five  minutes'  duration, 
but  whose  software  experiences  a  severe  fault  every  thirty 
seconds  on  average,  is  obviously  not  ready  to  conduct  an 
expensive  live  exercise  or  actual  mission.  Here  are  some 
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models  that  attenpt  to  predict  (mean  or  average)  time  to 
failure . 

1.  Jelinslcl  and  Moranda  Modal 

Jelinski  and  Moranda  developed  the  "De- Eutrophication 
Model"  (Moranda  and  Jelinski,  1972),  (Parr,  1983).  The 
assumptions  are: 

a  The  rate  of  fault  detection  is  proportional  to  the  current 
fault  content  of  a  program. 

a  All  faults  are  equally  likely  to  occur  and  are  independent 
of  each  other. 

a  Each  fault  is  of  the  same  severity  as  any  other  fault. 

a  The  fault  rate  remains  constant  over  the  interval  between 
fault  occurrences . 

a  The  software  is  operated  in  a  manner  similar  to 
anticipated  operational  usage. 

a  The  faults  are  corrected  instantly,  without  introduction 
of  new  faults  into  the  program. 

The  hazard  rate  for  the  ith  fault  is 

Zi(t)  =e[W-(i-l)]  ,  (2.1) 

where:  N  =  total  number  of  faults  initially  in  the  system 

i  =  ith  fault  to  occur 
0  =  proportionality  constant. 

Xj  =  tj  -  tj.,  is  the  time  between  the  ith  and  the  (i-l)st  fault 
and  is  assumed  to  have  an  esqjonential  distribution  with  rate 

Zi(ti)  : 

f(;fi)  .  (2.2) 
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The  likelihood  function  for  the  parameters  6  and  N  is 


L{Xi. 


(2.3) 


Taking  the  partial  derivatives  of  ln(L)  with  respect  to  N  (N 
is  allowed  to  assume  any  real  value  as  a  convenient 
approximation)  and  0,  and  then  setting  the  equations  equal  to 
zero,  the  solutions  for  the  following  set  of  equations  are 
obtained  as  nteiximum  likelihood  estimates  for  N  and  6  (N  is 
estimated  by  numerical  techniques,  then  used  to  solve  for  0)  : 


0  = 


n 


(2.4) 


1 

iV-(i-l) 


(2.5) 


The  estimate  for  the  mean  time  between  failure  (MTBF)  for  the 
(i+l)st  fault  occurrence  is 


1 

Ziti) 


1 

6(/^-i)  ■ 


(2.6) 


The  data  required  to  use  the  Jel inski -Moranda  model  are  the 
observed  times  of  the  fault  occurrence  (t/s),  or  the  times 
between  the  faults  (Xj's). 

2.  Schick ‘Wolver ton  Model 

The  hazard  rate  for  the  Schick-Wolverton  model 
(Schick  and  Wolverton,  1978)  and  (Farr,  1983)  is  proportional 


to  the  number  of  faults  in  the  program  and  the  amount  of 
testing  time.  An  assunqption  of  the  model  is  that  as  more 
testing  is  con^leted  the  probcibility  of  detecting  faults 
increases  because  of  "zeroing- in"  on  the  areas  of  code  where 
the  errors  lie.  The  assumptions  are: 

•  The  rate  of  fault  detection  is  proportional  to  the  current 
fault  content  and  to  the  amount  of  time  expended  in 
testing. 

•  All  faults  are  equally  likely  to  occur 

•  All  faults  are  independent  of  each  other 

•  All  faults  are  of  the  same  severity 

•  The  software  is  operated  in  a  manner  similar  to  the 
anticipated  operational  usage 

•  Perfect  fault  correction  occurs. 

The  hazard  function  is 

=0[//-(i-l)  ,  (2.7) 

where:  Xj  =  the  amount  of  time  spent  testing  between  the 

occurrence  of  the  ith  and  the  (i-l)st  fault 
N  =  total  number  of  faults  initially  in  the  program 
0  =  proportionality  constant. 

The  reliability  function  of  Xj  is 

R(Xi}=exp{-6[N-(i-l)]^)  .  (2.8) 

A 
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The  density  function  of  Xj  is 


(2.9) 


f  (Xj)  =-R'(Xi)  =0  [i\^-  (i-1)  ]  X^e 
If  Xi^/2  is  replaced  by  Yj  the  model  is  formally  identical  to 
the  Jel inski -Moranda  model  previously  described.  In  fact, 
substitution  of  any  known  function  of  Xj  allows  transformation 
to  the  Jelinski -Moranda  model.  N  and  0  are  estimated  by 
MLE'S: 


0=- 


2n 


1 

(i^-(i-i))  2 


(2.10) 


(2.11) 


The  estimate  for  the  mean  time  between  failure  (MTBF)  for  the 
(i+i)st  occurrence  is 


1 

20  (^-i) 


(2.12) 


The  data  requirements  are  the  time  of  the  fault  occurrence,  t^, 
or  the  time  between  the  ith  and  (i-l)st  fault. 

3.  Geometric  Model 

The  Geometric  model  (Moranda,  1975)  and  (Farr,  1983) 
is  a  modification  of  the  Jel inski -Moranda  "De-Eutrophication" 
model.  It  differs  from  that  model  as  follows;  it  does  not 
assume  a  fixed  number  of  faults  in  the  progreun,  and  the  faults 
are  not  equally  likely  to  occur  because  as  debugging 


progresses  faults  become  harder  to  detect.  The  assumptions 
are; 


•  There  is  an  infinite  number  of  total  faults  (the  program 
is  never  totally  fault  free) . 

•  All  faults  do  not  have  the  same  chance  of  detection. 

•  Detections  of  faults  are  independent. 

•  The  software  is  operated  in  a  manner  similar  to 
anticipated  operational  usage. 

•  The  fault  detection  rate  forms  a  geometric  progression  and 
is  constant  between  faults. 


The  hazard  rate  for  the  ith  fault  is 

Z,.(t)=De"-i  ,  (2.13) 

where;  tj  =  time  between  the  ith  and  the  (i-l)th  fault 
D  =  initial  hazard  rate 
0  =  fault  detection  rate  (O<0<1) 

n  =  the  nth  fault  to  occur. 

X,  =  time  between  the  ith  and  the  (i-l)st  fault.  The  Xj  are 
independently  and  exponentially  distributed  with  rate  Zi(t)  , 
so  the  density  function  of  Xj  is 


(2.14) 


D  and  0  are  estimated  by  MLE's; 


D=- 


0" 


(2.15) 


12 


(2.16) 


Equation  (2.16)  is  solved  for  6,  and  that  value  is  substituted 
into  (2.15)  to  find  D.  From  these  equations  the  MTBF  until 
the  (n+l)st  fault  occurs  after  n  faults  have  occurred  can  be 
obtained: 


The  data  requirements  are  the  time  of  the  ith  fault  (tj)  ,  or 
the  time  between  the  faults  (Xj) ,  for  i  =  l,2,...,n. 

4.  Use  of  Time  Between  Errors  (TBE)  Models 

The  TBE  for  models  in  this  category  can  be  measured  in 
either  wall -clock  time  or  CPU  time.  The  models  may  be  used  to 
predict  the  expected  time  to  the  next  failure.  Confidence 
limits  on  the  expected  value  should  be  used  to  obtain  a  range 
of  time  to  the  next  failure.  The  software  manager  should  be 
asking:  is  the  expected  time  of  next  time  of  failure  longer 
than  the  time  required  for  operational  testing  of  the  software 
within  the  overall  system?  If  the  time  required  for 
operational  testing  of  the  system  is  greater  than  the  mean 
time  to  failure  for  the  (i+l)th  failure  then  the  prudent 
software  manager  should  consider  postponing  operational 
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testing  in  favor  of  continued  developmental  activity  and 
testing. 

B.  FAULT  COUNT  MODELS 

Fault  count  models  use  the  niunber  of  faults  that  occurred 
in  a  testing  interval  to  determine  the  expected  number  of 
faults  in  the  next  testing  interval.  Software  managers  can 
employ  this  method  by  simply  counting  the  number  of  faults  in 
a  given  test  period  i.e.  day,  week,  or  month,  provided  test 
exposures  are  the  same.  This  provides  insight  into  how  well 
the  testing  process  is  working. 

1.  Generalized  Poisson  Model 

The  Generalized  Poisson  Model  (Schafer  et  al,  1979) , 
(Farr,  1983)  is  similar  to  the  Jel inski- Mo randa  and  Schick- 
Wolverton  models  but  uses  fault  count  observations  in  fixed, 
equal -length  intervals  rather  than  times  between  faults.  The 
assumptions  are: 

•  The  expected  n\imber  of  faults  occurring  in  any  time 
interval  is  proportional  to  the  fault  content  (number  of 
bugs  remaining)  at  the  time  of  testing,  and  to  the  amount 
of  time  that  has  been  previously  spent  in  testing.  The 
actual  number  of  faults  that  appear  is  assumed  to  be 
Poisson  distributed. 

•  All  faults  are  equally  likely  to  occur  and  are  independent 
of  each  other. 

•  Each  fault  is  of  the  same  severity. 

•  The  software  is  operated  in  a  manner  similar  to  the 
anticipated  operational  usage. 


•  The  faults  are  corrected  at  the  ends  of  the  testing 
intervals.  (Note:  Faults  discovered  in  one  test  interval 
may  be  corrected  at  another  test  interval;  the  only 
restriction  is  that  the  fault  correction  come  at  the  end 
of  the  testing  intervals.) 


Testing  intervals  are  of  length  Xj,  and  fj  faults  occur  during 
the  ith  interval.  At  the  end  of  the  ith  interval  a  total  of 
Mj  faults  are  corrected. 

The  expected  number  of  faults  in  the  ith  interval  is 


E{fi)=e[N-Mi.^]gi(x^,X2,  .  .  .  ,Xi)  ,  (2.18) 

where:  0  *  proportionality  constant 

N  =  initial  number  of  faults 

gj  =  function  of  the  amount  of  testing  time  spent 

previously  and  currently  and  is  nondecreasing; 
as  testing  progresses  more  faults  are  found 
specifically, 


(Xj , Xj ,  .  .  . , Xj)  =Xi  ,  (2.19) 

where  a  is  assumed  known. 

f;  is  Poisson  with  mean  =  0(N-Mi.,)gj.  N  and  0  are  estimated  by 
MLE'S: 


0  = 


(3.20) 


(2.21) 
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These  non-linear  equations  must  be  solved  for  0  amd  N.  From 
this  the  expected  number  of  errors  in  the  (n+l)st  test 
interval  can  be  obtained, 

=0  (Xi - -  ,  (2.22) 

where;  x„+,  is  the  anticipated  testing  time  for  the  (n+l)st 
test  interval . 

The  data  requirements  for  this  model  are  the  lengths  of  the 
test  intervals,  (X;)  ,  the  total  number  of  faults  corrected  at 
the  end  of  a  test  interval ,  (Mj) ,  and  the  number  of  faults 
discovered  in  each  interval  (fj)  . 

2.  Non -homogeneous  Poisson  Process  Model 
The  Non -homogeneous  Poisson  Process  Model  (NHPP)  (Goel  and 
Okumoto,  1979)  and  (Farr,  1983)  assvimes  that  the  fault  counts 
for  testing  intervals  follows  a  Poisson  distribution.  The 
expected  number  of  faults  in  the  Poisson  process  model  is 
proportional  to  the  nximber  of  faults  left  in  the  program.  The 
assumptions  are: 

•  The  software  is  operated  in  a  manner  similar  to  the 
anticipated  operational  usage. 

•  The  numbers  of  faults  detected,  (fj)  ,  in  the  any  test 
interval,  (ti.i,tj),  are  independent  for  any  finite 
collection  of  times  t,,<t2, . .  .,ti,  . . .  ,tn^i,t„. 

•  Faults  are  of  the  same  severity. 

•  Faults  are  equally  likely  to  be  detected. 

•  The  cumulative  number  of  faults  detected  at  any  time  t, 
(N(t)),  is  a  Poisson  distribution  with  mean  m(t)  .  The 
mean,  m(t) ,  is  the  expected  number  of  faults  to  occur  for 


any  time  period  (0,t)  and  is  proportional  to  the  expected 
number  of  undetected  faults  at  time  t. 

•  m(t)  is  bounded. 

The  specific  mean  function  used  is 

in( t)  =a (1-e'*^)  ,  (2.23) 

and  fj  is  the  number  of  faults  in  the  ith  interval, 

,  (2.24) 

where:  a  =  expected  total  number  of  faults  to  be 

eventually  detected, 
a  and  b  can  be  estimated  by  MLE's: 


E"  f- 

1U_JL  ,  (2.25) 

d-e'^"-) 


(2.26) 


From  the  estimates  of  a  and  b  the  expected  n\amber  of  faults  in 
the  next  (m+l)st  test  interval  is  estimated  to  be 


^ ^ )  -in{t„)  =He e )  . 


(2.27) 


The  data  required  for  this  model  are  the  fault  counts  of  each 
test  interval,  (fj)  and  time  of  the  test  interval,  (tj)  . 

3.  Schneldetrlnd' s  Software  Reliability  Model 


Schneidewind' s  model  (Schneidewind,  1975)  and  (Farr, 
1983)  maintain  that  as  testing  progresses  the  fault  detection 


process  changes.  The  later  faults  are  therefore  more  useful 
in  detemining  future  fault  counts.  The  model  allows  for 
three  approaches. 

1.  Utilize  all  the  fault  counts  from  the  m  intervals. 

2.  The  first  (s-1)  intervals  are  ignored  and  only  the  s 
through  m  interval  fault  counts  are  considered. 

3.  The  first  (s-l)  intervals  fault  counts  are  summed,  and 
the  individual  fault  count  from  the  remaining  s  through 
m  intervals  are  treated  individually.  Denote  the  sum  of 
the  fault  counts  in  the  first  s-l  intervals  by: 

Method  1  is  used  when  the  analyst  feels  that  all  intervals 
will  be  useful.  Method  2  can  be  used  when  a  significant 
change  in  the  fault  detection  process  has  occurred  at 
approximately  the  (s-l)st  interval.  Method  3  attempts  to 
combine  the  effects  of  both  approaches.  The  assumptions  for 
all  methods  are  the  same: 

•  The  fault  counts  for  each  interval  are  independent  of  each 
other. 

•  The  fault  correction  rate  is  proportional  to  the  number  of 
faults  to  be  corrected. 

•  The  software  is  operated  in  a  manner  similar  to  the 
anticipated  operational  usage. 

•  The  mean  number  of  detected  faults  decreases  from  one 
interval  to  the  next. 

•  Intervals  are  all  of  the  same  length. 


•  The  rate  of  fault  detection  is  proportional  to  the  number 
of  faults  remaining.  The  fault  detection  process  is 
assumed  to  be  a  non -homogeneous  Poisson  process  with  an 
exponentially  decreasing  appearance  and  detection  rate. 

The  rate  of  change  of  the  number  of  faults  detected  in  the  ith 
interval  is 

d^=ae'-f"^  .  (2.29) 

The  cumulative  mean  number  of  faults  that  occurs  up  to  and 

including  interval  i  is 

=  .  (2.30) 

P 

The  mean  number  of  faults  for  the  ith  interval  is 

=  (e<-^<^-i>>-e<'P^>)  .  (2.31) 

a  and  0  can  be  determined  by  MLE's: 

0=ln(y)  ,  (2.32) 
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(2.35) 


''.-5^-,  A-  <2.36) 

For  Met)iod  2,  y  is  the  solution  to 

(2.37) 

where : 

•<  '  <2-38) 

f*..=EI..fi  '  <2-39) 


1-e'* 

For  Method  3,  y  is  the  solution  to 

y«-l_l  y-1  y»-l  ' 


(2.40) 


(2.41) 


where:  A  is  the  seune  as  Method  1  and  F,j„  is  the  same  as  Method 

2.  From  the  MLE's  of  bt  and  $  the  expected  number  of  faults  in  • 

the  (m+l)st  interval  is  - 

.  (2.42) 

H 
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The  time  needed  to  detect  a  total  number  of  M  faults  is 

(2.43) 

$ 

The  data  needed  for  this  model  are  the  fault  counts  for  each 
interval  and  a  history  of  testing  process  in  order  to 
determine  the  interval  that  testing  procedures  may  have 
altered  significantly. 

4.  Use  of  Fault  Count  Models 

Fault  count  models  use  the  nxjmber  of  faults  that  occur 
in  some  testing  interval.  The  models  in  this  category  predict 
the  expected  nxamber  of  faults  to  occur  in  some  additional  time 
interval.  Confidence  limits  on  the  expected  number  should  be 
used  to  obtain  a  range  of  the  predicted  nuonber  of  faults  to 
occur  for  that  time  interval.  Since  there  can  never  be  a  one 
hundred  percent  guarantee  of  perfect  software,  the  software 
manager  should  be  asking:  is  the  predicted  number  of  faults 
to  occur  for  the  time  interval  of  interest  acceptable  for 
operational  testing?  If  the  predicted  number  of  faults  to 
occur  is  too  great  then  the  prudent  software  manager  should 
postpone  operational  testing  in  favor  of  continued 
developmental  activity  and  testing. 

C.  SOFTWARE  RELIABILITY  MODELS 

The  nxjmber  of  software  reliability  models  continues  to 
grow.  Assumptions  have  broadened  to  reflect  the  reality  of 
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the  software  development  process  with  increased  accuracy.  The 
assumptions  of  some  models  described  appear  to  be  limiting. 
Faults  all  of  the  same  severity  can  be  worked  around  by 
modeling  faults  according  to  severity.  The  assumption  that 
all  faults  are  equally  likely  to  occur  and  independent  of  each 
other  can  be  resolved  by  assiaming  low  severity  faults  occur 
more  frequently  than  high  severity  faults,  but  faults  of  the 
same  severity  class  will  be  considered  equally  likely  to 
occur.  Instantaneous  fault  correction  can  be  avoided  by  not 
counting  faults  which  were  previously  detected  (and  counted  at 
time  of  initial  detection) ,  but  were  not  corrected  (Farr, 
1983)  . 

Software  managers  need  to  be  aware  of  the  limitations  and 
underling  assumptions  that  underlie  the  various  models  that 
are  available.  The  data  that  is  needed  to  fit  the  models  is 
critical  to  relic±>le  results.  The  data  collection  needs  to  be 
an  accurate  reflection  of  the  meaningful  historical  testing  of 
the  software.  Some  of  the  data  that  should  be  collected  is 
computer  usage  time,  testing  intensity,  extent  of  the  software 
that  was  tested  (was  the  entire  system  tested  or  just  a 
particular  module),  and  milestones  in  the  software's 
development  (are  requirements  changed  or  added  midway  through 
the  development  of  the  software?)  and,  of  course,  the  cost  of 
testing. 

This  study  illustrates  the  use  of  a  particular  relicd)ility 
model.  Some  of  the  specific  questions  that  this  thesis 
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addresses  are:  How  is  a  software  reliability  model  used? 
What  type  of  information  does  a  model  require?  What  kind  of 
decision  can  a  software  manager  make  based  on  the  results  of 
the  reliability  model? 

In  today's  fiscal  environment  software  managers  should 
have  a  "warm  fuzzy  feeling"  substantiated  by  quantitative 
results  for  their  product  prior  to  initiating  costly  full 
scale,  live  operational  testing. 


23 


III.  DATA  ANALYSIS 

A.  MODEL  DEVELOPMENT 

The  model  that  is  applied  in  this  thesis  is  based  on  the 
assumption  that  the  rate  of  error  occurrence  is  a  non- 
stationary  Poisson  process  (NSPP)  (Dalai  and  Mallows,  1988) . 
The  model  is  identical  to  the  Schneidewind  model,  and  is 
fitted  according  to  Method  1,  which  assumes  that  all  fault 
data  is  of  equal  value.  Let  N(t)  be  the  number  of  faults  that 
occur  in  (0,t);  where  t  is  software  running  time.  The 
probability  that  the  number  of  faults  to  occur  by  time  t  is 
given  by; 


P(N(t)  =n}  = 


(A(t) )” 


(3.1) 


where  X (t)  =X (l-e***)  .  A  test  time,  t,,  was  chosen.  This  length 
of  time  is  divided  into  periods  of  length  A  =  t,/J;  where  J  is 
the  total  number  of  intervals.  The  jth  interval  is  such  that 
(j-l)A<t<jA.  The  number  of  observed  counts  (faults)  in  the 
jth  interval  is  rij.  The  probability  distribution  for  the 
number  of  faults  in  [(j-l)A  to  jA]  is 


P[Nj=N(jA)  -N(  ( j-1)  A) 


|i^=0,l,2,  (3-2) 
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where 


,  (3.2a) 

=;^,e-»‘<J-i>A(i_g-|.A)  _  (3.2b) 

The  parameters  fi  and  X  are  estimated  by  maxlmiun  likelihood. 
The  likelihood  function  is 


(3.3) 


The  natural  log  of  L(X,/x)  is 


l(X,\L)=ln(L)=-J^.^^Xj*Y^.^^njln(Xj)  .  (3.4) 


The  partial  derivatives  of  with  respect  to  X  and  /x  are 
taken  and  set  equal  to  zero.  This  allows  X  to  be  written  in 
terms  of  and  n(t,)  ,  the  total  number  of  counts  to  occur  up 
to  time  t,,  as 


nit^) 

(1 


(3.5) 


X  is  substituted  into  the  partial  derivative  of  I  with  respect 
to  n  to  give, 


t  n^tz  )  Ae”**^ 

dl/d^=-n{tj  ^  _^-An(tJ+ r-—— — *0  ,  (3.6) 


1-e 


where 


(3.7) 


fi  can  now  be  solved  for  from  the  following  ecjuation: 
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(3.8) 


_An(tg) 

l-e-»*A  ■  n(tg) 

This  equation  closely  resembles  Schneidewind' s  result;  see 
(2.41).  Since  t,=AJ,  equation  (3.8)  becomes 

e-»»^  _  J-e'»*"«  ,^(tj  (3^9) 

1-e-**^  n(tg) 


^  nitg) 

i-(e-'‘^)'^  n{tg)~ 


(3.10) 


By  letting  x^e**^  into  equation  (3.10)  becomes, 

X  _T  X"^ 

1-x  u(tg) 


(3.11) 


X  is  solved  for  iteratively.  Let  J«0  for  the  first  iteration, 
then 


r(l) 


.  x(l)  _nits) 
l-x(l)  nitg) 


(3.12) 


x(l) 


,  r(l) 

n(tg)  +n(tg)  l+r(l) 


r(2)  is 


r(2)  =.r(l)  -t-J . 

l-xd)-^ 


(3.13) 


(3.14) 


x(2)  is  given  by. 
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(3.15) 


X(2)  = 


r(2) 

l+r(2)  ■ 


Hence  the  iteration  of  r(n)  and  x(n)  is 


r  (n+1 )  =r  (n)  + 


§ 


(3.16) 


x(n+l)  .  (3.17) 

1+r  (n+l) 

The  iterative  process  continues  until  x(n+l) -x(n)  <  e,  where 
€  is  a  suitcLhle  small  number;  x(n-i-l)  is  then  sxibstituted  into 
equation  (3.5)  to  get  X.  Using  the  estimates  of  fi  and  X,  the 
expected  number  of  faults  to  be  observed  in  some  additional 
operating  time  t<,,  where  (t,,  t,+to)  is  of  length  IcA,  can  be 
estimated 

^[iy?(tj -J7(t^)]  =n(t,)  (l-e-l»*A) }  .  (3.19) 

1-e'*^  * 


A  Bayesian  methodology  is  discussed  in  the  appendix.  This 
method  attempts  to  utilize  past  experience  from  software 
projects  having  similar  characteristics  as  the  software  in 
question.  If  the  distributions  of  X  and  n  are  known  from 
experience  then  this  information  Ccui  be  useful  in  estimating 
the  parcuneters  ^  and  fi. 
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B.  BOOTSTRAP 


Bootstrapping  was  used  to  obtain  the  confidence  limits  for 
Ji,  X,  and  E[N(to) -N(t,)  ]  «  E[2!iN(to)].  This  technique  takes 
Into  account  the  sait^llng  uncertainties  In  the  estimates  by 
removing  the  errors  In  the  standard  approximation  (Dalai  et 
al,  1989)  and  (Efron,  1985) .  To  obtain  the  estimates  of  the 
sampling  varlcdjlllty  of  fi,  X,  and  E  [N(to) -N(t,)  ]  =E  [A(to)  ] 
proceed  as  follows.  The  probeiblllty  that  a  count  occurs  In 
the  jth  period  Is  conditional  on  N(t,)=n(t,): 

P{N^=n^,  .  .  .  ,Nj=nj\N^-^N2-^.  .  .+i\7j=n(tJ}  (3.20a) 


n(t,)  \  kj 


(3.20b) 


where  I!X,=l-e-''‘^.  From  this  the  probability  that  a  count  falls 
In  the  jth  Interval  Is 


l-e-jpA 

p.=jL_E - 


(3.21) 


Uniform  (0,1)  random  nximbers  were  generated,  where  the 
k=l,2,  . .  ,n(t,)  ;  Is  the  kth  random  number.  If  P(j.,)<UksPj  then 
a  count  Is  added  to  n^.  The  simulated  n^'s  were  then  used  to 
re-estlmate  ft,  X,  and  E[AN(to)];  these  are  the  bootstrap 
values.  This  process  was  repeated  1000  times  to  get  a  range 
of  values  for  fi,  X,  and  E[AN(t„)]  .  To  create  a  90%  confidence 
limit  of  the  estimate  E[AN(to)]  the  1000  bootstrap  estimates 
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of  E[AN(to)]  were  ordered  and  the  values  of  the  50th  and  950th 
quantiles  were  found.  These  are  quoted  as  the  90%  confidence 
region  (EtANCtJlj,  E[AN(tJ]„). 

C.  RESULTS 

The  estimates  for  the  parameters  were  obtained  using  three 
different  A  values  and  three  different  t.  values.  The  value 
to  was  selected  such  that  t,,  +  t,  -  time  of  last  observed  fault 
to  occur;  this  allows  for  comparison  of  the  predicted  expected 
number  of  faults  to  occur  with  the  observed  data.  The  data 
provided  In  Tables  1  through  6  are  the  90%  confidence  Interval 
obtained  by  the  bootstrap.  The  most  difficult  aspect  of  this 
thesis  research  was  obtaining  appropriate  test  data.  The 
data  that  I  received  from  various  sources  was  unacceptable  for 
various  reasons:  no  testing  history,  severity  of  faults  not 
listed,  no  milestone  events  listed  (l.e.  one  data  set  covered 
10  years  but  no  Indication  of  modifications  to  the  software) , 
non- software  errors  listed  with  software  errors,  description 
of  errors  could  not  be  Interpreted  (which  may  have  eliminated 
some  of  the  problems  mentioned  above) .  The  underlying  cause 
of  this  Is  that  organizations  that  I  contacted  for  data  do  not 
use  any  systematic  method  for  determining  software 
reliability.  A  "warm  fuzzy  feeling"  for  the  software  seems  to 
be  the  current  method  used  to  judge  the  relled>lllty  of  the 
software.  This  feeling  gets  warmer  and  fuzzier  as  deadlines 
draw  closer.  The  data  sets  used  In  the  euialysls  of  the  model 


were  obtained  from  a  technical  report  on  other  software 
relicd)ility  models  (Abdalla  et.  al.,  1986).  The  data  was 
given  as  time  (CPU)  between  failures.  The  results  of  the 
bootstrap  for  Data  Set  l  are  given  in  Tables  1-3;  the 
graphical  results  (Dalai,  1990)  are  depicted  in  Figures  1-3. 
The  results  of  the  bootstrap  for  Data  Set  2  are  given  in 
Tables  4-6;  the  graphical  results  (Dalai,  1990)  are  depicted 
in  Figures  4-5. 

D.  USE  OF  RESULTS 

Suppose  a  time  t,  has  been  spent  testing  the  software,  and 
n(t,)  faults  were  found.  The  n(t,)  faults  can  be  broken  up 
into  Hj's,  the  number  of  faults  in  each  period  j  of  size  A  (En^ 
=  n(t,)).  This  information  can  be  used  to  estimate  the 
parameters  pL  and  X,  and  a  point  estimate  of  the  mean  or 
expected  number  of  faults  to  appear  in  the  time  interval  (t,, 
t,+to)  .  Operational  testing  of  the  system  will  require  some 
time  to.  Bootstrapping  can  now  be  done  to  assess  the  sampling 
uncertainty  in  the  estimate  of  the  expected  number  of  faults 
to  appear  in  (t,,  t,+to)  .  This  will  be  done  by  (juoting 
bootstrapped  90%  confidence  limits.  The  expected  number  of 
faults  predicted  to  occur  can  be  conpared  to  the  requirements 
of  the  system  i.e.  for  some  time  to  for  example;  at  most  F 
faults  are  allowed  (suppose  F  can  be  specified)  .  If  the 
predicted  expected  number  of  faults  is  less  than  the  allowable 
n\nnber  of  faults  then  system  operational  testing  might  be 


worth  the  expense  at  this  time.  In  contrast  to  this,  if  the 
expected  niimber  of  faults  is  greater  than  the  specified  number 
of  faults  then  system  operational  testing  should  be  postponed. 
Testing  should  continue  in  the  lab,  at  the  developmental  level 
until  t,  and  n(t,)  are  large  enough  that  the  expected  number  of 
faults  for  the  required  operational  time  meets  specification. 

A  more  conservative  approach  is  to  replace  the  estimate  of 
the  mean  nxamber  of  faults  by  the  upper  confidence  limit  of  the 
mean  number  of  faults.  Such  a  conservative  approach  is 
recommended . 

If  there  are  no  specifications  the  individual  responsible 
for  scheduling  system  operational  testing  will  have  to  make  a 
subjective  decision.  Is  the  expected  number  of  faults  to 
occur  in  (t,,  t.+t^)  small  enough  to  warrant  spending  the  money 
to  carry  out  system  operational  testing,  or  should  this 
testing  be  postponed  until  the  expected  nvimber  of  faults  is 
lower.  The  assumption  is  that  ledD  testing  will  continue  on 
the  software,  increasing  t,  and  n{t,),  but  reducing  the  number 
of  unfound  and  uncorrected  faults.  The  more  faults  found  in 
lab  testing  of  the  software  the  fewer  the  number  of  faults 
that  are  likely  to  occur  in  the  more  costly  system  operational 
testing. 
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B.  APPLICATION  TO  TNO  DATA  SETS 

The  fitting  and  error  assessment  procedure  was  applied  to 
two  data  sets  (Abdalla  et  al,  1586).  Figures  1,  2,  and  3 
refer  to  Data  Set  l;  Figures  4,  5,  and  6  to  Data  Set  2. 

Figure  1  has  a  A  of  10  CPU  minutes  with  three  combinations 
of  t,  and  to.  If  the  range  of  the  expected  number  of  faults 
for  t,=1250,  to=250  (2.21  to  6.09)  is  acceptadole  the  software 
manager  may  choose  to  schedule  operational  testing.  The  same 
argument  can  be  made  for  t,=1000,  to=500.  A  problem  occurs  for 
t,=500  and  to=1000.  If  the  range  for  the  expected  number  of 
faults  to  occur  (4.69  to  22.22)  is  acceptable  the  software 
manager  may  choose  to  schedule  operational  testing. 
Unfortunately,  46  faults  occur  in  (t,,  t.+t^)  .  This  is 
extremely  likely  to  be  the  result  of  use  of  an  inappropriate 
model  (it  does  seem  unlikely  that  software  with  as  many  as  22 
mission- critical  faults  would  be  viewed  as  acceptable  for 
starting  operational  testing) .  What  can  the  software  manager 
do  to  prevent  something  like  this  from  occurring?  Ideally,  as 
testing  continues,  the  rate  at  which  faults  occur  should 
decrease  (assuming  a  constant  relative  rate  of  testing) ,  with 
that  rate  asymptotically  approaching  zero  as  t,  becomes  large. 
The  slope  of  the  estimated  total  expected  nxamber  of  faults 
verses  test  time  for  Data  Set  l  from  T=300  to  T»500  is  m=0.08 
(faults/cpu  min) .  Figure  1  depicts  this:  the  rate  at  which 
faults  are  occurring  does  not  appear  to  be  tapering  off.  The 
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software  manager  can  use  this  information  to  support  a 
decision  to  go  ahead  with  (or  postpone)  operational  testing. 
From  T=1000  to  T=1500  the  slope  is  0.028  (faults/cpu  min)  and 
appears  to  be  tapering  off.  The  range  of  the  expected  number 
of  faults  to  occur  in  the  specified  t^  accurately  reflect  what 
actually  occurred.  If  the  range  of  the  e3^ected  number  of 
faults  is  acceptable  the  software  manager  should  go  ahead  with 
operational  testing.  Figure  2  (A  =  20  cpu  minutes)  and  Figure 
3  (A  =  50  cpu  minutes)  can  be  interpreted  similarly. 

The  change  in  A  for  both  data  sets  did  not  have  a 
significant  impact  on  the  range  of  the  expected  number  of 
faults  to  occur,  indicating  that  the  model  is  somewhat 
insensitive  to  the  size  of  A. 

Data  Set  2  (Figures  4,5,  and  6)  shows  only  a  small 
indication  of  the  slope  decreasing.  This  is  why  the 
confidence  limits  of  the  expected  number  of  faults  is  so  wide. 
The  software  manager  can  apply  the  same  techniques  listed 
above  to  make  a  decision  to  schedule  (or  postpone)  operational 
testing.  The  software  manager  must  repeatedly  address  the 
questions:  is  the  rate  of  occurrence  of  faults  lessening,  and 
is  the  range  of  expected  number  of  faults  acceptable  to 
support  operational  testing? 

A  fitted  model  may  indicate  a  narrowing  range  of  expected 
number  of  faults  and  slope  asyn^totically  approaching  zero, 
consequently  the  software  manager  schedules  operational 


testing.  Unfortunately,  the  results  of  the  operational 
testing  may  be  poor  i.e.  a  relatively  large  number  of  errors 
may  occur  indicating  that  more  developmental  activity  and 
testing  is  required  to  improve  the  software.  For  example,  the 
model  predicts  n(to)«22  for  Data  Set  1  (t,=500,  to=1000)  ,  but 
the  number  of  observed  faults  that  occurred  in  t^  was  more 
than  twice  the  predicted  amount,  46.  This  example  illustrates 
the  relationship  between  modeling  and  testing.  While  a 
systematic  underestimation  indicates  flaws  in  the  model, 
occasional  underestimation  simply  reinforce  that  software 
reliability  models  do  not  take  the  place  of  stressing  software 
within  a  full  system  in  a  real-life  operational  environment. 
The  purpose  of  this  thesis  is  to  provide  the  software  manager 
with  a  tool  to  aid  in  the  decision  as  to  when  to  initiate 
operational  testing,  not  to  replace  such  a  test. 
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TABLE  1 

ESTIMATE  OF  PARAMETERS  FOR  DATA  SET  1 
t.-1250,  t„-250  {CPU  MINUTES) 
Observed  niimber  of  bugs  in  is  6 
90%  Confidence  Interval 


A 

(CPU  min) 

10 

5 

% 

95 

% 

20 

5 

% 

95 

% 

50 

5 

% 

95 

% 

0.00272 

0.00176 


0.00270 

0.00174 


0.00270 

0.00175 


134.595 

146.509 


134.993 

147.798 


135.258 

148.142 


E[N(t„)] 


2.21 

5.76 


2.32 

6.09 


2.25 

5.82 


TABLE  2 

ESTIMATE  OF  PARAMETERS  FOR  DATA  SET  1 
t.»1000,  to=500  (CPU  MINUTES) 
Observed  number  of  bugs  in  to  is  14 
90%  Confidence  Interval 


A  (CPU  min) 


10  5  % 

95  % 


20  5  % 

95  % 


50  5  % 

95  % 


0.00298 

0.00177 


0.00298 

0.00176 


0.00296 

0.00175 


128.701 

147.640 


128.969 

148.393 


129.828 

150.549 


E[N{tJ] 


5.03 

14.73 


5.07 

14.81 


5.17 

14.96 


TABLE  3 

ESTIMATE  OF  PARAMETERS  FOR  DATA  SET  1 
t.«500,  t„=1000  (CPU  MINUTES) 
Observed  nvunber  of  bugs  in  t^  is  46 
90%  Confidence  Interval 


A  (CPU  min) 


10  5  % 

95  % 


20  5  % 

95  % 


50  5  % 

95  % 


0.00600 

0.00327 


0.00600 

0.00326 


0.00588 

0.00317 


95.010 

112.711 


95.352 

113.863 


96.859 

118.432 


E[N(t„)] 


4.69 

20.97 


4.70 

21.14 


5.00 

22.22 
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TABLE  4 

ESTIMATE  OF  PARAMETERS  FOR  DATA  SET  2 
t,-800,  to»300  (CPU  SECONDS) 
Observed  number  of  bugs  in  is  12 
90%  Confidence  Interval 


A 

(CPU  min) 

10 

5 

% 

95 

% 

20 

5 

% 

95 

% 

50 

5 

% 

95 

% 

0.00288 

0.00111 


0.00288 

0.00111 


0.00287 

0.00111 


82.479 

126.998 


82.718 

127.645 


83.722 

131.003 


E[N(to)] 


4.75 

14.69 


4.73 

14.64 


4.78 

14.66 


TABLE  5 

ESTIMATE  OF  PARAMETERS  FOR  DATA  SET  2 
t,=600,  t„=500  (CPU  SECONDS) 
Observed  number  of  bugs  in  is  21 
90%  Confidence  Interval 


A  (CPU  min) 


10  5  % 

95  % 


20  5  % 

95  % 


50  5  % 

95  % 


0.00298 

0.00068 


0.00296 

0.00067 


0.00298 

0.00067 


78.513 

195.611 


79.189 

200.950 


80.710 

211.307 


E[N(to)] 


10.10 

37.09 


10.21 

37.32 


10.14 

37.43 


TABLE  6 

ESTIMATE  OF  PARAMETERS  FOR  DATA  SET  2 
t,=400,  to=700  (CPU  SECONDS) 
Observed  number  of  bugs  in  t^  is  37 
90%  Confidence  Interval 


A  (CPU  min) 


10  5  % 

95  % 


20  5  % 

95  % 


50  5  % 

95  % 


0.00456 

0.00058 


0.00458 

0.00054 


0.00446 

0.00047 


58.950 

239.964 


59.423 

263.011 


62.014 

325.387 


E[N(tJ] 


9.03 

62.43 


8.96 

63.88 


9.45 

66.55 
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Plgur«  4.  Data  Set  2,  A  -  10  (CPU  seconds) 
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Figure  5.  Data  Set  2,  A  -  20  (CPU  seconds) 
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Plgurtt  6.  Data  Set  2,  A  •  50  (CPU  seconds) 
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IV.  CGMCLUSION 


Software  reliability  models  are  useful  tools  that  managers 
of  software  intensive  projects  have  at  their  disposal.  The 
bootstrapping  technique  will  provide  the  manager  a  range  of 
expected  n;amber  of  faults  estimated  to  occur  for  some 
additional  operating  time.  The  question  is,  is  the  upper 
limit  of  the  expected  number  of  faults  estimated  to  occur 
acceptable?  The  potential  risks  are  additional  cost  for 
further  testing  or  late  product  delivery.  The  ideal  case  is 
reliable  software  delivered  on  time  and  on  budget. 
Unfortunately,  reality  is  rarely  ideal.  The  software  manager 
must  decide:  is  it  better  to  deliver  a  product  on  time  that 
may  be  considered  unreliable  by  the  user  and  be  sent  back  for 
further  testing,  or  to  deliver  a  product  late  but  of 
acceptable  quality  to  the  user?  The  purpose  of  this  thesis  is 
to  provide  a  quantitative  tool  for  the  manager  who  may  have  to 
make  such  qualitative  decisions.  The  use  of  software 
relicdjility  models  is  not  without  associated  cost,  and  risk. 
The  data  must  be  collected  for  input  to  the  model. 
Recommendations  for  the  type  of  data  that  should  be  collected 
are: 

•  Operating  time  between  failures  (CPU  time  is  the  best) 
(Musa  and  Okumoto,  1984) . 

•  Calendar  time  between  failures,  although  such  times  may 
not  accurately  reflect  the  opportunity  for  faults  to 
reveal  themselves  (Musa  et  al,  1907) . 
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•  Testing  history  i.e.  how  many  people  are  involved  in  the 
testing  effort. 

•  How  the  software  was  tested 

•  Intensity  of  the  software  testing 

•  Cost  of  testing  i.e.  the  cost  to  find  and  repair  a  fault 
before  and  after  product  delivery. 

Without  useful  data  a  reliability  model  has  little 
practical  use.  The  model  presented  in  this  thesis  should  be 
validated  using  data  from  several  Navy  systems. 

There  are  several  areas  for  further  research.  How 
accurate  are  the  predicted  confidence  limits  in  this  model? 
What  are  the  limits  of  applicability  of  this  model?  What 
effect  do  inaccuracies  (due  to  replacing  observed  data  with 
hypothesized  data  in  cases  where  insufficient  data  is 
available)  have  on  the  model  i.e.  how  robust  is  the  model? 
Further  development  of  other  software  reliability  models 
should  be  pursued.  Emphasis  should  be  placed  on  obtaining 
confidence  limits  in  addition  to  quoting  only  a  point  estimate 
of  the  expected  number  of  failures  predicted  to  appear  for 
some  additional  testing  time.  These  models  should  be  verified 
using  data  obtained  from  Navy  software  intensive  systems.  It 
is  infeasible  to  test  every  possible  branch  in  a  large  program 
for  faults.  The  software  manager  needs  technical  assistance 
in  identifying  where  effort  and  money  should  be  spent  to 
deliver  the  best  possible  product.  Will  many  faults  in 
portions  of  the  software  that  are  rarely  used/reached  cause 
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more  problems  for  the  user  than  a  few  faults  in  frequently 
used/reached  portions. 
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APPENDIX 


Software  projects  may  have  similar  characteristics  such  as 
testing  strategies  or  architecture,  so  that  the  information 
obtained  about  the  relicdjility  of  one  software  project  may  be 
used  to  aid  in  the  prediction  of  the  reliability  of  another 
similar,  software  project.  This  process  can  make  use  of 
Bayesian  methodology  (Dalai  and  Mallows,  1990) ,  (Farr,  1983)  . 
If  prior  distributions  of  X  and  n  are  specified  then  this 
information  can  be  used  help  estimate  the  parameters  X  and  fi; 
the  posterior  for  these  is 

Pi,^(X,^)  =iCL(X,fi)p^(X)p^(ji)  ,  (a. la) 


(a. lb) 

where  and  p„(m)  are  the  prior  distributions  of  X  and  fi 

estimated  from  another  software  project  that  has 
characteristics  similar  to  the  software  project  currently 
being  tested.  The  simplest  idea  is  to  integrate  out  X  and 
marginalize  on  which  yields: 

p^(^)=ii^j‘*e-^<l-*''"'>X'’'"•’p^(A)dX•e■'*^"'"•’  (i-e'^^) .  (a. 2) 
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The  most  convenient  choice  of  P).(\)  is  the  (conjugate)  Gcumna: 


P^(^)  =e 


--.i  (aA)P-^ 


rcp) 


(a. 3) 


which  when  substituted  into  equation  (a. 2)  yields  the  density. 


(a. 4a) 


°  (ot+l-e'i*^*^)  ”*^**  *^ 


(a, 4b) 


_j^//g-l‘An(C,)  n(t,) 


{a.4c) 


(oj+l  -e'MAJ)  «(t»)  ♦P 

Using  an  uninformative  prior,  a=0,  /3==0,  and  setting  x=e''‘^ 

equation  (a. 4c)  becomes 


p^(x)=/<-x”'"'’ (1-x) \  .  (a. 5) 

(a+l-x*^) 


The  mode  of  the  density  is 

7(x)  =ln(p^(x) )  =n(  t.)  lnx+n(  t,)  ln{l-x)  -  (n  ( t,)  +P)  In  (o+l-x*^)  . 

(a. 6) 

Taking  the  partial  derivative  of  ecjuation  (a. 7)  with  respect 
to  X  yields: 
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(a. 7) 


-  X  Jx^ 

n{t^)  1-x  nit^)  a+l-x^ 

If  Q'=/S=0  equation  (a. 7)  is  the  same  as  equation  (3.11),  which 
gives  the  MLE. 

Suppose  m=E[^]  and  =  Var  (^]  in  the  prior,  then  a  =  vci/a^  and 
jS  =  Equation  (a. 7)  is 

X  _i3(t,) +in(in/(T2)  ^  jx-'  _n(t,) 

1-x  n{t,)  ^  {m/a^)+l-x-'  n(t,) 


If  X  is  interpreted  as  the  total  number  of  faults  in  a 
particular  software  project,  then  the  number  of  faults  is 
discrete  so  a  discrete  distribution  should  be  used  for  the 
prior,  i.e.  one  could  use  a  Poisson  for  the  prior.  However, 
it  is  easier  to  work  with  a  Gamma  distribution.  If  the  Gamma 
distribution  has  same  parameters  as  a  Poisson  then  equation 
(a. 8)  is  (since  m=a^) 

X  Jx-'  ^S(t,) 

1-x  nit,)  2-x^  nit,) 

It  is  clear  that  the  variance  to  mean  ratio  of  the  prior  has 
strong  influence  on  the  effect  of  a  prior  estimate  of  the 
mean. 

One  Bayesian  approach  to  estimation  is  to  find  the  mean 
(rather  than  the  mode,  or  highest  point  of  the  posterior  as  is 
essentially  done  in  the  likelihood  approach)  of  the 
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(approximate)  posterior,  Osxsl.  To  obtain  an  approximate 
posterior  mode  proceed  as  follows.  If  J  is  large  x*  is  small 


provided  x>0,  so  expand  in  Taylor's  series  to  get 

p,(x)=A:**  [x*(l-x)"+(£l^)  (x*+J)  (1-x)"]  ,  (a. 10) 

1+a 

where;  n=n(t,)  and  n=n{t,). 

Equation  (a.  10)  is  a  convex  combination  of  two  beta  densities. 
K"  can  be  found  by  setting  the  left  hand  side  of  (a. 11)  =  1. 
ECsg]  =  EEe'"^]  can  be  found, 

r(n+n+l)  n+n+1 


+  (  )  r(n-*-J'-»-l)r(n-«-l)  n+tJ+l  j 

l+ft  r(n+J+n+l)  n+J’+n+l 

The  approximation  to  this  is 


n\  n*l  ^  n-t-p  (n+J)  I  n+J+l 
(i3-»-i3)  !  n+n+1  l-^-g  [n+J+n)  !  n+J+n+1 

n!  ^  n-i-p  (n+J)  ! 

(n+n)  !  1+a  (n+«7+n)  ! 


(a. 12) 


Unfortunately,  n=n(t,)=136  for  Data  Set  1;  even  with  factoring 
out  n=n{tg)  ,  the  factorial  ratios  are  on  the  order  of  10'“°. 


However,  it  is  justificUole  to  use  an  approximation  to  the 


factorials  to  get  _  „  _  _ 

n+1  ^  n+P  n*J+l  ^  n+1  ^  j 

n+n+1  1+tt  n+n+J*l  n-»-n+l 
l+«  n+n+l 


(a. 13) 
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The  numerical  results  of  equation  (a.  13)  are  in  Tcdales  A1 
through  A6  for  Data  Sets  1  and  2.  The  graphical  results  are 
shown  in  Figures  A1  through  A6.  The  range  of  the  estimated 
number  of  faults  to  occur  in  (t,,  t.+t^)  is  much  smaller  than 
that  of  the  bootstrap  results  discussed  in  Chapter  III.  None 
of  the  results  (estimated  number  of  faults  to  occur)  using  the 
Bayesian  method  contain  the  observed  faults.  A  possible 
explanation  for  this  is  inappropriate  values  for  o!  and  /3 
{Of=j3=0)  .  After  various  projects  have  been  analyzed  with 
software  reliadaility  models,  fault  distribution  may  become 
more  apparent.  This  information  can  then  be  incorporated  to 
reliability  models.  I  feel  that,  despite  the  surprising 
initial  results,  this  method  does  promise  to  be  a  useful  tool 
to  the  software  manager. 


¥ 
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TABLE  A1 

BAYESIAN  ESTIMATE  OF  PARAMETERS  FOR  DATA  SET  1 
t,-1250,  t„»250  (CPU  MINUTES) 

Observed  number  of  bugs  in  is  6 
90%  Confidence  Interval 


A  (CPU  min) 


10  5  % 

95  % 


20  5  % 

95  % 


50  5  % 

95  % 


0.00339 

0.00263 


0.00340 

0.00264 


0.00339 

0.00262 


% 

E[N(tJ] 

131.892 

1.08 

135.028 

2.42 

131.870 

1.10 

134.992 

2.48 

131.914 

1.09 

135.103 

2.45 

TABLE  A2 

BAYESIAN  ESTIMATE  OF  PARAMETERS  FOR  DATA  SET  1 
t,=  1000,  t„=500  (CPU  MINUTES) 

Observed  number  of  bugs  in  is  14 
90%  Confidence  Interval 


A  (CPU  min) 


10  5  % 

95  % 


20  5  % 

95  % 


50  5  % 

95  % 


0.00399 

0.00311 


0.00399 

0.00310 


0.00398 

0.00309 


124.289 

127.719 


124.304 

127.768 


124.328 

127.828 


E[N(t„)] 


1.98 

4.51 


1.99 

4.54 


2.01 

4.58 


TABLE  A3 

BAYESIAN  ESTIMATE  OF  PARAMETERS  FOR  DATA  SET  1 
t.-500,  t„-1000  (CPU  MINUTES) 

Observed  niunber  of  bugs  in  t^  is  46 
90%  Confidence  Interval 


A  (CPU  min) 


10  5  % 

95  % 


20  5  % 

95  % 


50  5  % 

95  % 


0.00808 

0.00602 


0.00809 

0.00601 


0.00797 

0.00596 


91.608 

94.660 


91.603 

94.697 


91.708 

94.805 


E[N(tJ] 


1.61 

4.65 


1.60 

4.69 


1.71 

4.79 
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TABLE  A4 

BAYESIAN  ESTIMATE  OF  PARAMETERS  FOR  DATA  SET  2 
t,»800,  t,-300  (CPU  SECONDS) 

Observed  nximber  of  bugs  in  is  12 
90%  Confidence  Interval 


A 

(CPU  min) 

10 

5 

% 

95 

% 

20 

5 

% 

95 

% 

50 

5 

% 

95 

% 

0.00464 

0.00340 


0.00464 

0.00340 


0.00465 

0.00339 


75.849 

79.192 


75.846 

79.216 


75.837 

79.242 


E[N(tJ  ] 


1.39 

3.32 


1.39 

3.34 


1.38 

3.35 


TABLE  A5 

BAYESIAN  ESTIMATE  OF  PARAMETERS  FOR  DATA  SET  2 
t.«600,  to=500  (CPU  SECONDS) 

Observed  number  of  bugs  in  t,,  is  21 
90%  Confidence  Interval 


A  (CPU  min) 


10  5  % 

95  % 


20  5  % 

95  % 


50  5  % 

95  % 


0.00600 

0.00429 


0.00600 

0.00429 


0.00596 

0.00429 


66.830 

70.363 


66.828 

70.361 


66.872 

70.368 


E[N(t„)] 


1.74 

4.74 


1.74 

4.73 


1.78 

4.74 


TABLE  A6 

BAYESIAN  ESTIMATE  OP  PARAMETERS  FOR  DATA  SET  2 
t,»400,  t„=700  (CPU  SECONDS) 

Observed  number  of  bugs  in  t^  is  37 
90%  Confidence  Interval 


A  (CPU  min) 


10  5  % 

95  % 


20  5  % 

95  % 


50  5  % 

95  % 


0.00897 

0.00625 


0.00893 

0.00624 


0.00891 

0.00622 


50.391 

53.386 


50.416 

53.397 


50.426 

53.432 


E[N(t„)] 


1.39 

4.33 


1.41 

4.34 


1.42 

4.38 
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400  800  1200  1600 

TIME  (CPU  MINUTES) 


Figure  X2.  Data  Set  1,  A  »  20  (CPU  minutes),  Bayesian  Method 


54 


400  800  1200  1600 
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D2 I  08  Ot  0 

siinvj  JO  jAiivnnhino 

Figure  A4.  Data  Set  2,  A  =  10  (CPU  seconds),  Bayesian  Method 
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IIME_(CPU  SECONDS^ 


Slinvj  JO  yjgjMriN  jAiivnnkNno 

I _  _ 

Figure  A5.  Data  Set  2,  A  =  20  (CPU  seconds),  Bayesian  Method 
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